{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据导入和导出是pandas中很基础且重要的一个部分。pandas提供了很多不同格式数据的导入和导出方法，可以将其他格式数据转为DataFrame格式。我们可以将list、dict格式数据转为dataFrame格式，也可以从本地的csv、json等文本格式数据和sql、MongoDB等数据库中读取和保存数据等等。下面就分别以三大类介绍一些常见的数据格式的导入与导出。\n",
    "下文中所有的示例代码都是在jupyter notebook中创作，还不太了解jupyter的小伙伴，可以先看看这篇文章哦：数据科学 | 始于Jupyter Notebooks：一份全面的初学者使用指南（链接：https://mp.weixin.qq.com/s/QhvKxWp9xR_Ui3YHCKX3gQ）。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1 np.array、list、dict格式数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.1 list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "一般读取一个list，生成的结果如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.DataFrame([1,2,3,4])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果读取的list中的每个元素都是一个元组，会发生什么呢？"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.DataFrame([(1,2,3,4),(2,3,4,5)],columns = ['value1','value2','value3','value4'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果忽略columns的话，第二个list的值不是列名，而是默认生成索引名，如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.DataFrame([(1,2,3,4),(2,3,4,5)],['value1','value2'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 dict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里我们以一个字典为数据，看下不同操作的结果有何不同。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = {'a':[1,2],'b':[2,3]}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "直接调用DataFrame进行读取的话，生成的DataFrame结构如下:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.DataFrame(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.DataFrame.from_dict(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "需要说明的是：from_dict这个方法只有在pandas 0.23版本后才有，如果在早期的版本如0.19中调用会出现报错。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果我们想以a，b作为索引，以list中的每个值分别为一列怎么操作呢？"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.DataFrame.from_dict(data,orient='index',columns = ['value1','value2'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如果进一步想让a、b生成列的话，调用reset_index方法即可。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.DataFrame.from_dict(data,orient='index',columns = ['value1','value2']).reset_index().rename(columns = {'index':'key'})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "但是如果我们想把字典的key和value分别生成两列，如何操作呢? \n",
    "一种方法是："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.DataFrame(list(data.items()),columns = ['key','value'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "还有一种方法依然是利用from_dict,不过就需要将value中的list提前转化成字符串，然后再进行操作即可。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.3 np.array"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "numpy是比pandas更底层一些的数据操作工具，pandas的很多操作也是基于numpy进行的，比如numpy就支持直接读取txt文件。比如有这样一个txt文件："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%%\n",
    "1 10 0.45240003518120125 1.0000444454321133 0.10599999999999998 1.0599999999999998e-01 0.22999999999999998 0.472   \n",
    "2 20 0.43459179018909283 1.1133165687809157 0.07834109593771774 7.8341095937717736e-02 0.2089183326689947 0.3863815370463022   \n",
    "3 30 0.40767309706715493 1.269342944674328 0.07190653014564094 7.1906530145640940e-02 0.17795528298262073 0.4136993009059622   \n",
    "4 40 0.3859105442514819 1.3433376585083965 0.066153468987387 6.6153468987386999e-02 0.1477849202849159 0.261667203674047   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "一共有4行8列的数据，数据间用空格隔开，表头带有%，那么读取的时候可以用loadtxt函数进行导入："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = np.loadtxt('examples/fit.txt', delimiter=None, comments='%',  usecols=(0, 1, 4,5))\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看到数据自动剔除了表头，并且只用了其中指定的列。接下来就可以将array导入到pandas中："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Data = pd.DataFrame(data, index = np.arange(len(data)), columns=['a','b','c','d'])\n",
    "Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.4 其他方式"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当然需要导入文本并不规则的时候，可以考虑直接利用python中的文件读取来一行一行的读取文件，然后利用json或者re等字符串处理包来处理数据，最后整合成DataFrame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(path, \"r\") as load_f:\n",
    "     l = f.readlines()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当然这个方法要结合具体的数据来看，这里就不展开介绍了。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2 文本格式数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 csv文件"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.1.1 读取csv文件"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "pandas.read_csv(filepath_or_buffer, sep=',', header='infer', names=None, indxe_col=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* filepath_or buffer: str, path object or file-like object。指定传入的文件路径，必须传入的参数。\n",
    "* sep: str。指定分隔符，默认是逗号分隔符。\n",
    "* header: int, list or int。指定行数用来作为列名。默认是如果没有传入names参数，则header=0，用第一行作为列名，否则header=None，以传入的names作为列名。另外如果传入的是list，例如[0,1,3]，则是以第1、2、4这些行作为多级列名，且中间的行，第3行会被忽略，数据从第5行开始。\n",
    "* names: array-like, optional。指定文件的列名。如果文件中没有标题行，建议传入此参数。\n",
    "* index_col: int, str, or sequence  of int / str, or False。指定文件的索引，默认为None。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 查看ex1.csv\n",
    "!cat examples/ex1.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 导入ex1.csv\n",
    "df = pd.read_csv('examples/ex1.csv')\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 查看ex2.csv，ex2.csv没有标题行\n",
    "!cat examples/ex2.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 设置sep和header参数，导入ex2.csv\n",
    "df2 = pd.read_csv('examples/ex2.csv',sep='|',header=None)\n",
    "df2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 设置sep和names参数，此时header默认为None\n",
    "df3 = pd.read_csv('examples/ex2.csv',sep='|', names=['ID','name','age','city','message'])\n",
    "df3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 对ex1.csv设置多级标题，将第1、2、4行作为标题，数据从第5行开始\n",
    "df4 = pd.read_csv('examples/ex1.csv',header=[0,1,3])\n",
    "df4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 导入ex1.csv，指定索引为message一列\n",
    "df5 = pd.read_csv('examples/ex1.csv',index_col='ID')\n",
    "df5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 导入ex1.csv，指定第1和2列作为多重索引\n",
    "df6 = pd.read_csv('examples/ex1.csv',index_col=[0,1])\n",
    "df6"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.1.2 导出csv文件"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "pandas.to_csv(path_or_buf, index=True, header=True, sep=',', encoding='utf-8')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* path_or_buf: str or file handle。指定保存文件路径，必须传入的参数，默认为None。\n",
    "* index: bool。导出的csv是否包含索引，默认为True。\n",
    "* header: bool or list of str。导出的csv是否包含标题行，默认为True。\n",
    "* sep: str。指定导出的csv文件的分隔符，默认为逗号分隔符。\n",
    "* encoding: str。指定导出的csv文件的编码，默认为utf-8。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.to_csv(\"output/out_ex1.csv\",index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 excel文件"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.2.1 导入excel文件"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "pandas.read_excel(io, sheet_name=0, header=0, names=None, index_col=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "read_excel和read_csv的用法差不多，一个需要注意的参数是sheet_name。这个参数是指定读取该excel中具体哪个表的数据，默认为0，即为第一个表。如果传入1，则为第2个表；可指定传入表名，如\"Sheet1\"；也可传入多个表，如[0,'Sheet3']，传入第一个表和名为'Sheet3'的表。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 读取ex1.xlsx文件，默认为读取第一个表\n",
    "df = pd.read_excel(\"examples/ex1.xlsx\")\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 读取ex1.xlsx文件的第2个表\n",
    "df2 = pd.read_excel(\"examples/ex1.xlsx\",sheet_name=1)\n",
    "df2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 读取ex1.xlsx文件的第2个表和名为\"Sheet3\"的表，返回的是对象是OrderedDict\n",
    "# OrderedDict是dict的子类，与dict不同的是，它记住了内容的顺序\n",
    "od = pd.read_excel(\"examples/ex1.xlsx\",sheet_name=[1,'Sheet3'])\n",
    "od"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 在这个orderedDict中，有两个key。\n",
    "# 第一个key是1，对应的value为该表的内容；第二个key是'Sheet3',对应的value是Sheet3表格的内容。\n",
    "# 我们选取key，就能得到相应的value。\n",
    "od[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "od['Sheet3']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.2.2 导出excel文件"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "DataFrame.to_excel(excel_writer, sheet_name='Sheet1')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* excel_writer: str。指定保存文件路径。\n",
    "* sheet_name: str。指定excel文件的表名，默认为’Sheet1‘。\n",
    "* index：bool。是否保存索引，默认为True。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.to_excel('output/out_ex1.xlsx')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.to_excel('output/out_ex2.xlsx',sheet_name='结果',index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 txt文件"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.3.1 导入txt文件"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "pandas.read_table(filepath_or_buffer, sep='t', header='infer', names=None, index_col=None）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "read_table与read_csv的区别就是，read_csv默认的sep参数是','，而read_table是't'。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat examples/ex3.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "df = pd.read_table('examples/ex3.txt')\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df2 = pd.read_table('examples/ex1.csv',sep=',')\n",
    "df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.3.2 导出txt文件"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用to_csv的方法"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df2.to_csv('output/ex3.txt',sep='\\t')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 csv和xlsx的选择"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当我们可以选择保存为csv或者xlsx格式，方便下次可以使用的时候，是选择保存为csv还是excel呢？除了考虑csv和excel文件大小之外（相同的数据下excel文件比csv文件小），这里可以考虑下read_csv和read_xlsx的性能问题。在stackoverflow上有人对这两种导入方法进行了一个简单的测试。\n",
    "* 测试文件：同样的数据集（分别是320MB的csv文件和16MB的xlsx文件）\n",
    "* 电脑硬件：i7-7700k，SSD\n",
    "* python环境：Anaconda Python 3.5.3, pandas 0.19.2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "用时:\n",
    "* pd.read_csv('foo,csv')\t用时: 2s\n",
    "* pd.read_excel('foo.xlsx')\t用时: 15.3s\n",
    "* df.to_csv('bar.csv',index=False)\t用时: 10.5s\n",
    "* df.to_excel('bar.xlsx',index=False)\t用时: 34.5s\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.5 json文件"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.5.1 导入json文件"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "pandas.read_json(path_or_buf=None, orient=None, typ='frame')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* path_or_buf: 指定文件路径，默认为None，必须传入的参数。\n",
    "* orient: json字符串格式，默认为None。这里有split,records,index,columns,values五种选择可选。\n",
    "    * split: dict like {index -> [index], columns -> [columns], data -> [values]}, 例如下面的ex4.json文件。\n",
    "    * records: list like [{column -> value}, ..., {column -> value}]，例如下面的ex5.json文件。\n",
    "    * index: dict like {index -> {column -> vlaue}}，例如下面的ex6.json文件。\n",
    "    * columns: dict like {column -> {index -> value}}，例如下面的ex7.json文件。\n",
    "    * values: just the values array，例如下面的ex8.json文件。\n",
    "* typ: 要转换为series还是dataframe，默认为frame。当typ=frame时，orient可选split/records/index,默认为columns；当typ=series,orient可选split/records/index/columns/value,orient默认为index。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# split格式\n",
    "!cat examples/ex4.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_json('examples/ex4.json',orient=\"split\")\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# records格式\n",
    "!cat examples/ex5.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df1 = pd.read_json('examples/ex5.json',orient=\"records\")\n",
    "df1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.read_json('examples/ex5.json',orient=\"records\",typ=\"series\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# index格式\n",
    "!cat examples/ex6.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df2 = pd.read_json('examples/ex6.json',orient=\"index\")\n",
    "df2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.read_json('examples/ex6.json',orient=\"index\",typ=\"series\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# columns格式，当typ='frame'时，orient默认为这个格式\n",
    "!cat examples/ex7.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df3 = pd.read_json('examples/ex7.json',orient=\"columns\")\n",
    "df3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df3_0 = pd.read_json('examples/ex7.json')\n",
    "df3_0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 用columns格式读取ex6.json，其实与index格式的结果是行列的转置\n",
    "df4 = pd.read_json('examples/ex6.json',orient=\"columns\")\n",
    "df4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# values格式\n",
    "!cat examples/ex8.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df5 = pd.read_json('examples/ex8.json',orient=\"values\")\n",
    "df5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.5.2 导出json文件"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "DataFrame.to_json(path_or_buf=None, orient=None,index=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* orient: string。指定导出json的格式。DataFrame默认是columns，Series默认是index。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.to_json(\"output/out_ex4.json\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat output/out_ex4.json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "se = pd.read_json('examples/ex6.json',orient=\"index\",typ=\"series\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "se.to_json(\"output/out_ex5.json\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat output/out_ex5.json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3 数据库"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1 mysql"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在开始之前，请确保环境中的python为3.x版本，且已经安装并开启mysql服务。这里我们使用pymysql库来连接mysql。首先需要通过pip安装pymysql。安装后，可以通过import语句检验是否已经安装成功。如果没有报错，则说明安装成功。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pip install pymysql"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pymysql"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 打开数据库连接\n",
    "# 注意在进行这一步之前要先创建好数据库。如果数据库不存在，这一步会报错。\n",
    "\n",
    "conn = pymysql.connect(host=\"localhost\",user=\"username\",password=\"password\",db=\"dbtest\")\n",
    "\n",
    "# 创建一个游标对象\n",
    "cursor = conn.cursor()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.1.1 数据导入mysql"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 创建数据库表\n",
    "sql_createTb = \"\"\"CREATE TABLE user (\n",
    "                 ID CHAR(4) NOT NULL,\n",
    "                 name CHAR(20),\n",
    "                 age INT,\n",
    "                 city CHAR(20)\n",
    "                 )\n",
    "                 \"\"\"\n",
    "# 执行SQL语句\n",
    "cursor.execute(sql_createTb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 插入数据\n",
    "insert1 = \"INSERT INTO user(ID,name,age,city) values('A001','小明',18,'北京')\"\n",
    "insert2 = \"INSERT INTO user(ID,name,age,city) values('A002','小王',20, '杭州');\"\n",
    "insert3 = \"INSERT INTO user(ID,name,age,city) values('A003','张三',18, '北京');\"\n",
    "insert4 = \"INSERT INTO user(ID,name,age,city) values('A004','张三',18, '北京');\"\n",
    "insert5 = \"INSERT INTO user(ID,name,age,city) values('A005','李四',23, '上海');\"\n",
    "insert6=  \"INSERT INTO user(ID,name,age,city) values('A006','小思',24, '广州');\"\n",
    "\n",
    "\n",
    "# 执行SQL语句\n",
    "for sql_insert in [insert1,insert2,insert3,insert4,insert5,insert6]:\n",
    "    cursor.execute(sql_insert)\n",
    "\n",
    "# pymysql默认是没有开始自动提交事务的\n",
    "# 所以在对更新数据库的时候，一定要手动提交事务\n",
    "# 提交事务\n",
    "conn.commit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 更新数据\n",
    "sql_update = \"update user set city='深圳' where ID='A001'\"\n",
    "\n",
    "# 执行SQL语句\n",
    "cursor.execute(sql_update)\n",
    "\n",
    "#提交事务\n",
    "conn.commit()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 删除数据\n",
    "sql_delete = \"delete from user where ID='A004'\"\n",
    "# 执行SQL语句\n",
    "cursor.execute(sql_delete)\n",
    "# 提交事务\n",
    "conn.commit()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.1.2 读取mysql数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 查询数据\n",
    "sql_search  = \"SELECT * FROM user\"\n",
    "\n",
    "# 执行SQL语句\n",
    "cursor.execute(sql_search)\n",
    "\n",
    "# 查看数据\n",
    "results = cursor.fetchall()\n",
    "results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 将结果转换dataframe格式\n",
    "df = pd.DataFrame(list(results))\n",
    "df.columns = ['ID','name','age','city']\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 关闭数据库连接\n",
    "cursor.close()\n",
    "conn.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 PostgreSQL"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "psycopg2是Python语言的PostgreSQL数据库接口之一，这里我们使用psycopg2连接，首先同样请确保环境中已经安装postgreSQL，以及已通过pip安装psycopg2了。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.2.1 读取postgreSQL的数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import psycopg2\n",
    " \n",
    "# 连接数据库\n",
    "conn = psycopg2.connect(database = 'name', user = 'username', password = 'password', host = '10.10.10.10', port = '5432')\n",
    " \n",
    "curs=conn.cursor()\n",
    " \n",
    "# 编写Sql，只取前两行数据\n",
    "sql = 'select * from table_name limit 2'\n",
    " \n",
    "#  数据库中执行sql命令\n",
    "curs.execute(sql)\n",
    " \n",
    "#获得数据\n",
    "data = curs.fetchall()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "返回的data结果是一个以各行数据为元组的列表，如下："
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[('L002','WKQ1','WZ1A','WZ1A','L','WZ01-12',\\\n",
    " '10073864791','R5400','5','18','36','362.29',\\\n",
    "'372.57','351','20190311','20190317','11','0','0',\\\n",
    "'0,8','3','3','0.83','3','3','20190310'),\n",
    "('L002','WKQ1','WZ1A','WZ1A','L','WZ01-14',\\\n",
    " '10073864791','R5400','5','18','36','300.29',\\\n",
    "'372.57','351','20190311','20190317','11','0','0',\\\n",
    "'0,8','3','3','0.83','3','3','20190310')]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以通过pandas对data进行进一步处理：pd.DataFrame(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在insert的时候，需要注意以下几点：\n",
    "1. 表中的字段不需要加引号；\n",
    "2. 插入的每行数值用括号包围，其中各个字段以逗号间隔，字符串型必须加引号；\n",
    "3. 以上sql命令可见，一条sql命令可以插入多条数据，只需要连接各个数据，最终commit一次就好；\n",
    "4. 另外在写入PG的时候，应该注意PG中的数据如果出现单引号“ ' ”会出现错误，所以必须先使用replace替换成其他的内容方可写入。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.2.2 数据写入postgreSQL"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#编辑写入数据的sql    \n",
    "insert_sql = \\\n",
    "\"insert into table_name\n",
    "(warehouse_code,storehouse_code,zone_code,picking_zone_code,picking_type,location_code,\\\n",
    "gds_id,kunnr,yest_tot_qty,week_tot_qty,month_tot_qty,week_avg_prlab,month_avg_prlab,\\\n",
    "present_storage,start_date,end_date,week_flag,unsaleble_flag,super_A,sales_segment,\\\n",
    "present_class,week_avg_qty,month_avg_qty,now_level,pred_level,statis_date,\\\n",
    "prediction_class,action,target_area) \n",
    "values\n",
    " ('L121','6','IWDX','IWDX','L','IWDX-001-0101','120339999','S70227820','0','0','2','1','1','1','20190311','20190317','11','0','0','0,7','1','0','0','1','1','20190310','2','2','targ_null'),\\\n",
    " ('L002','WKQ1','WZ1A','WZ1A','L','WZ0114','10073864791','R5400','5','18','36','300.29',\\\n",
    "'372.57','351','20190311','20190317','11','0','0',\\\n",
    "'0,8','3','3','0.83','3','3','20190310')\"\n",
    "        \n",
    "curs.execute(insert_sql)\n",
    " \n",
    "#提交数据    \n",
    "conn.commit()\n",
    "#关闭指针和数据库\n",
    "curs.close()\n",
    "conn.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.3 sqlalchemy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "首先需要先通过pip安装sqlalchemy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sqlalchemy import create_engine\n",
    "\n",
    "# 数据库名字\n",
    "db = 'dbtest'\n",
    "\n",
    "# user为用户名，password为密码\n",
    "engine = create_engine(\"mysql+pymysql://username:password@localhost:3306/%s?charset=utf8mb4\" % db, echo=False)\n",
    "\n",
    "# 如果需要连接其他数据库，需要更改这里的create_engine\n",
    "# 如postgresql:create_engine('postgresql://username:password@localhost/mydatabase')\n",
    "# 具体可以查看 https://docs.sqlalchemy.org/en/13/core/engines.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.3.1 读取数据库数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在pandas中，我们可以通过read_sql_table和read_sql的方法来读取数据库，pandas会帮我们将结果直接转为dataframe的格式，这对于需要dataframe格式数据的来说是非常方便的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 读取某个表的数据\n",
    "pandas.read_sql_table(table_name, con)\n",
    "# 查询sql\n",
    "pandas.read_sql(table_name, con)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 读取数据库下现有所有的表\n",
    "pd.read_sql('show tables', engine)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 读取数据下某个表的数据\n",
    "pd.read_sql_table('user', engine)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.3.2 数据导入数据库"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "将dataframe格式的数据导入到数据库中，我们可以使用to_sql的方法。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DataFrame.to_sql(name, con, if_exists='fail')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* name：导入库的表的名字\n",
    "* if_exists：默认为\"fail\"，表示如果表不存在，直接报错；可选\"replace\"，导入的dataframe直接覆盖该表；可选\"append\"，将数据添加到表的后面。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('examples/sql.csv')\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 将df导入数据库中\n",
    "table_name = 'user2'\n",
    "df.to_sql(table_name, engine, if_exists='append', index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.read_sql_table('user2',engine)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.4 MongoDB|"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "MongoDB 是目前最流行的 NoSQL 数据库之一，使用的数据类型 BSON（类似 JSON）。这里我们使用PyMongo连接MongoDB数据库。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 导入前需要pip安装pymongo，并开始mongoDB服务\n",
    "from pymongo import MongoClient\n",
    "\n",
    "# 连接mongoDB数据库 \n",
    "myclient = MongoClient('mongodb://localhost:27017/')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.4.1 将数据导入数据库"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 创建一个集合\n",
    "mydb = myclient[\"dbtest\"]\n",
    "mycol = mydb[\"user\"]\n",
    "\n",
    "\n",
    "# 导入一条数据,data的格式为{col:value}\n",
    "data_one = {\"ID\":\"A011\",\"name\":\"小黑\",\"age\":18,\"city\":\"深圳\"}\n",
    "mycol.insert_one(data_one)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 导入多条数据\n",
    "data_many = [{\"ID\":\"A012\",\"name\":\"小红\",\"age\":23,\"city\":\"深圳\"},\n",
    "{\"ID\":\"A013\",\"name\":\"小白\",\"age\":30,\"city\":\"深圳\"},\n",
    "{\"ID\":\"A014\",\"name\":\"小蓝\",\"age\":24,\"city\":\"深圳\"}]\n",
    "mycol.insert_many(data_many)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 将dataframe转为json后导入mongo\n",
    "import json\n",
    "def df2mongo(df, col):\n",
    "    records = json.loads(df.T.to_json()).values()\n",
    "    result = col.insert_many(records)\n",
    "    return result\n",
    "df = pd.read_csv('examples/sql.csv')\n",
    "df2mongo(df,mycol)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.4.2 读取数据库数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 读取mongo某个集合的数据\n",
    "mycol.find()\n",
    "\n",
    "# 读取mongo某个集合的数据，并转为dataframe数据格式\n",
    "df = pd.DataFrame(list(mycol.find()))\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 指定条件查询,返回所有符合条件的数据 \n",
    "myquery = { \"city\": \"上海\" }\n",
    "mydoc = mycol.find(myquery)\n",
    "list(mydoc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 指定条件查询,返回符合条件的指定条件数据 \n",
    "mydoc = mycol.find(myquery).limit(1)\n",
    "for x in mydoc:\n",
    "    print(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "myquery = { \"age\": { \"$gt\": 24 } }\n",
    " \n",
    "mydoc = mycol.find(myquery)\n",
    " \n",
    "for x in mydoc:\n",
    "    print(x)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
